This vignette introduces scBubbletree, a transparent methodology for single cell RNA-seq data exploration based on well established methods clustering and visualization. We will demonstrate the functionality of scBubbletree by analyzing three scRNA-seq datasets (Case studies A, B and C), while also showing how to integrate scBubbletree with existing pipelines for scRNA-seq analysis e.g. based on Seurat.
To run this vignette we need several R-packages. Load them now:
source(file = "../Rutil/Graphics.R")
source(file = "../scBubbletree/R/util.R")
source(file = "../scBubbletree/R/main.R")
source(file = "../scBubbletree/R/annotation.R")
#library(scBubbletree, lib.loc = lib.loc)
library(cluster, lib.loc = lib.loc)
library(Seurat, lib.loc = lib.loc)
library(ggplot2, lib.loc = lib.loc)
library(reshape2, lib.loc = lib.loc)
library(parallel, lib.loc = lib.loc)
library(ape, lib.loc = lib.loc)
library(ggtree, lib.loc = lib.loc)
library(org.Hs.eg.db, lib.loc = lib.loc)
library(bluster, lib.loc = lib.loc)
library(SummarizedExperiment, lib.loc = lib.loc)
library(ggrepel, lib.loc = lib.loc)
In this case study we will analyze scRNA-seq mixture of 3,918 cells derived from 5 adenocarcinoma cell lines: H2228, H1975, HCC827, H838 and A549
Cell types (ground truth) were inferred for each cell on the basis of known genetic variation with demuxlet. We will use this simple dataset to demonstrate the advantages of quantitative scRNA-seq data exploration with scBubbletree in combination with Seurat preprocessing.
Load the data and perform some basic scRNA-seq data processing steps with Seurat:
# Lets load the benchmark data
load(file = "raw_data/Tian_2019/sc_mixology-master/data/sincell_with_class_5cl.RData")
# We are only interested in the 10x data object 'sce_sc_10x_5cl_qc'
d <- sce_sc_10x_5cl_qc
# Remove the remaining objects
rm(sc_Celseq2_5cl_p1, sc_Celseq2_5cl_p2, sc_Celseq2_5cl_p3, sce_sc_10x_5cl_qc)
# Get the meta data for each cell
meta <- colData(d)[, c("cell_line_demuxlet", "non_mt_percent", "total_features")]
# Create Seurat object from the raw counts and append the meta data to it
d <- Seurat::CreateSeuratObject(counts = d@assays$data$counts,
project = '')
# check if all cells are matched between d and meta
# table(rownames(d@meta.data) == meta@rownames)
d@meta.data <- cbind(d@meta.data, meta@listData)
# cell type predictions are provided as part of the meta data
table(d@meta.data$cell_line)
# select 5,000 most variable genes
d <- Seurat::FindVariableFeatures(object = d,
selection.method = "vst",
nfeatures = 5000)
# Preprocessing with Seurat: SCT transformation + PCA + UMAP
d <- SCTransform(object = d)
d <- RunPCA(object = d, npcs = 50, features = VariableFeatures(object = d))
d <- RunUMAP(d, dims = 1:50)
Exploring 2D UMAPs is by now a standard first step in scRNA-seq data analysis. As this is a toy dataset composed of distinct cell lines, we should be able to interpret the resulting 2D UMAP without much trouble.
The 2D UMAP (left panel) appears to contain between 5 and 8 clusters of cells. After color-coding the cells according to their predicted cell types (right panel) we see the 5 clusters (cell lines), with some substructure also visible within the clusters.
While 2D UMAPs are intuitive, by themselves these maps fail to provide the user with quantitative information about the scRNA-seq data. For instance, due to the high overplotting it difficult to tell how many cells are found in the different clusters. Such basic information on the composition of the sample, we think, is essential to check whether the experiment has worked as planned.
Furthermore, techniques such as UMAP and t-SNE primarily focus on preserving local cell distances, which makes the interpretation of global distances challenging. This means that we cannot interpret distances between cells in the UMAP as distances in Euclidean space, which also implies that we cannot learn about the cell type structure in our data directly from the 2D UMAP or t-SNE plots.
All of these challenges will be exaggerated in more realistic scenarios, such as scRNA-seq data derived from complex (heterogeneous) tissues and also composed of 10- to 100-times higher number of cells. In such analyses we will see crowded 2D UMAPs composed of many clusters of cells that are not clearly separated from each other. We will encounter many of these challenges in Case Study B.
Using 2D UMAPs in publications presents a challenge not only for the readers but also for the reviewers, who have to evaluate these complex figures without having access to the raw data.
For a more quantitative and transparent exploration and visualization of scRNA-seq we developed scBubbletree.
As first input scBubbletree uses matrix \(A^{n\times f}\) which represents a low-dimensional projection of the original scRNA-seq data, with \(n\) rows as cells and \(f\) columns as low-dimension features. We will use the PCA data generated by Seurat as \(A\). In particular, we will use the first 15 principal components (PCs).
# This is the main input of scBubbletree -> matrix A
A <- d@reductions$pca@cell.embeddings[, 1:15]
# A has n=cells as rows, f=features as columns (e.g. from PCA)
dim(A)
FALSE [1] 3918 15
The scBubbletree algorithm performs these main operations:
kkAs scBubbletree uses the well known k-means clustering of \(A\) to identify clusters of transcriptionally similar cells, our first goal is to determine k, i.e. the number of clusters in the data. This can be achieved with the function get_k.
get_k performs k-means clustering B times (bootstrapping iterations) for a vector of k values specified by the parameter ks. For this we draw a sample of rows from \(A\) with replacement. The proportion of rows sampled is controlled by cv_clust_p.
get_k then computes three metrics for each k and B which allow us to find the optimal k:
The remaining parameters of get_k are explained below:
n_start and iter_max: used to tune k-means (see ?kmeans)cores: computer cores to be usedcv_index_p proportion of cells to be used for the computation of the gap statistic. For small samples (e.g. < 20,000 cells), you can use the default parameter. For larger samples even low cv_index_p=0.1 produces robust results while speeding up the computation. Smaller B (e.g. B=100) might be helpful to execute this function in several minutes less than one hour for large datasets.Lets run get_k now.
if(redo_case_a) {
# Determine appropriate number of clusters (k)
b <- get_k(B = 100,
cv_clust_p = 1,
cv_index_p = 1,
ks = 1:20,
x = A,
n_start = 10,
iter_max = 50,
cores = 20)
if(dir.exists("case_study_A")==F) {
dir.create("case_study_A")
}
save(b, file = "case_study_A/b.RData")
} else {
b <- get(load("~/scBubbletree/case_study_A/b.RData"))
}
Mean index values and 95% confidence intervals obtained from get_k are shown below. The raw data for each bootstrap iteration are also stored as part of the object b.
k is is easily discernible from each of the three metrics (panels).
The silhouette index peaks around k=5 (left panel) and for larger ks the silhouette index starts to drop and is numerically less stable. Hence, we can conclude that k values between 4 and 6 seem like a reasonable choice for this dataset, which is consistent with the 2D UMAP structure shown before. This conclusion is confirmed by the gap statistic (middle panel) and the elbow method based on WSS (right panel), i.e. we see a knee in the curves at k=5.
Lets start with the simplest model (segmentation) k=5 and use scBubbletree to perform clustering and use the clustering data to generate a dendrogram, showing the clusters as bubbles (tree leaves) in the dendrogram.
# Perform clustering to get data for scBubbletree
k5 <- get_bubbletree(x = A,
k = 5,
seed = 1234,
cores = 1,
B = 100,
N_eff = 100,
verbose = F,
n_start = 100,
iter_max = 100,
round_digits = 1,
show_branch_support = T)
FALSE Clustering ...
FALSE Bubbletree construction ...
Lets now plot the resulting dendrogram:
k5$tree
The generated dendrogram has k=5 bubbles (clusters) shown as tree leaves. The radius of each bubble is scaled as function of the the number of cells that belong to the corresponding cluster. Analogously, the bubbles are color-coded according to their sizes, i.e. dark bubbles are larger and have many cells, and bright bubbles are small and contain few cells. The absolute and relative cell frequencies in the different bubbles are shown as labels.
Bubble 3 is the largest (and darkest) one in the dendrogram and contains 1,253 cells (\(\approx\) 32% of all cells in the dataset). Bubble 5 is the smallest one (and brightest) and contains only 436 cells (\(\approx\) 11% of all cells in the dataset).
We can access the bubble data shown in the bubbletree with the following code:
knitr::kable(k5$tree_meta, digits = 1)
| label | c | n | p | pct | tree_order | |
|---|---|---|---|---|---|---|
| 5 | 5 | 436 | 3918 | 0.1 | 11.1 | 5 |
| 4 | 4 | 593 | 3918 | 0.2 | 15.1 | 4 |
| 3 | 3 | 1253 | 3918 | 0.3 | 32.0 | 3 |
| 1 | 1 | 761 | 3918 | 0.2 | 19.4 | 2 |
| 2 | 2 | 875 | 3918 | 0.2 | 22.3 | 1 |
The average distances between a pair of bubbles are represented by the sums of branch length in the tree. This information is included as part of the object k5:
# c_i, c_j = pair of clusters/bubbles
# M = mean inter-cluster dissimilarity
# L95/H95 = lower/upper bounds 95% confidence
# interval of the mean dissimilarity
knitr::kable(k5$pair_dist$hc_pair_dist, digits = 1)
| c_i | c_j | M | SE | L95 | H95 |
|---|---|---|---|---|---|
| 1 | 1 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | 2 | 86.7 | 0.1 | 86.5 | 86.8 |
| 1 | 3 | 84.5 | 0.1 | 84.2 | 84.7 |
| 1 | 4 | 84.1 | 0.1 | 83.9 | 84.3 |
| 1 | 5 | 84.1 | 0.1 | 83.9 | 84.3 |
| 2 | 1 | 86.7 | 0.1 | 86.5 | 86.8 |
| 2 | 2 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 3 | 85.6 | 0.2 | 85.2 | 86.1 |
| 2 | 4 | 86.3 | 0.1 | 86.0 | 86.5 |
| 2 | 5 | 86.3 | 0.1 | 86.0 | 86.5 |
| 3 | 1 | 84.5 | 0.1 | 84.2 | 84.7 |
| 3 | 2 | 85.6 | 0.2 | 85.2 | 86.1 |
| 3 | 3 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 4 | 82.3 | 0.2 | 82.0 | 82.6 |
| 3 | 5 | 82.3 | 0.2 | 82.0 | 82.6 |
| 4 | 1 | 84.1 | 0.1 | 83.9 | 84.3 |
| 4 | 2 | 86.3 | 0.1 | 86.0 | 86.5 |
| 4 | 3 | 82.3 | 0.2 | 82.0 | 82.6 |
| 4 | 4 | 0.0 | 0.0 | 0.0 | 0.0 |
| 4 | 5 | 72.6 | 0.1 | 72.4 | 72.7 |
| 5 | 1 | 84.1 | 0.1 | 83.9 | 84.3 |
| 5 | 2 | 86.3 | 0.1 | 86.0 | 86.5 |
| 5 | 3 | 82.3 | 0.2 | 82.0 | 82.6 |
| 5 | 4 | 72.6 | 0.1 | 72.4 | 72.7 |
| 5 | 5 | 0.0 | 0.0 | 0.0 | 0.0 |
In general we see similar distances between the different pairs of bubbles. The only exception is the somewhat smaller distance between the bubbles 4 and 5. The tree is also devoit of structure, which makes sense as the input data is composed of cells derived from from 5 distinct cell lines. In case study B, we will analyze scRNA-seq data from PBMCs, where cell type structure is evident, i.e. we will see branches in the corresponding phylogenetic tree that are formed by transcriptionally related clusters of cells (e.g. CD4+ and CD8+ T-cells).
By default the function get_bubbletree annotates the branches of the bubbletree with their bootstrap support. This tells us the frequency with which a given branch in the tree is found the \(B\) bootstrap trees. Above we called get_bubbletree with B=100, and two of the tree branches have complete support (100 out of 100) while the branch that joins bubbles (4, 5) and bubble 3 has lower support (38 out of 100 => 76%). This tells us that the corresponding branch is not robust.
One way of visualizing the branch support is by using density trees. For this we can use function ggdensitree provided by ggtree and provide as input the \(B\) phylogenetic trees generated during the boostraping procedure and stored in the object k5:
ggtree::ggdensitree(k5$ph$boot_ph, alpha=0.1, colour='steelblue')+
geom_tiplab(size = 7)+
hexpand(ratio = 0.5)
Can we verify that this clustering is sensible?
We downloaded gene expression data for 1,019 human cancer cell lines from: https://www.ebi.ac.uk/gxa/experiments/E-MTAB-2770/Downloads and extracted data (gene TPM values) on 69 lung adenocarcinoma cell lines.
From this data we computed Euclidean distances between pairs of cell lines and generated a hierarchical clustering dendrogram with average linkage.
It turns out that A549 and HCC827 share a common branch in this dendrogram => similar gene expression profiles are accurately described by the bubbletree topology. Other cell lines are dispersed in different branches of the dendrogram and hence we do not see any special structure connecting the corresponding bubbles of the bubbletree.
tpm <- read.csv(file = "manuscript_data/E-MTAB-2770-query-results.tpms.tsv",
sep = "\t", comment.char = "#")
tpm <- tpm[, c(1, 2, which(regexpr(pattern = "lung\\.adenocarcinoma",
text = colnames(tpm)) != -1))]
rownames(tpm) <- tpm$Gene.ID
tpm$Gene.ID <- NULL
tpm$Gene.Name <- NULL
tpm <- t(tpm)
rownames(tpm) <- gsub(pattern = "\\.\\.papillary\\.lung\\.adenocarcinoma|\\..lung\\.adenocarcinoma|NCI\\.",
replacement = '', x = rownames(tpm))
rownames(tpm) <- gsub(pattern = "\\.",
replacement = '',
x = rownames(tpm))
# require(compositions)
# atpm <- compositions::acomp(tpm)
# euc_atpm <- dist(x = atpm, method = "euclidean")
# ah <- hclust(d = dist(atpm, method = "euclidean"), method = "average")
h <- hclust(d = dist(tpm, method = "euclidean"), method = "average")
five_cell_lines <- c("H838", "H2228", "A549", "H1975", "HCC827")
cols <- rep(x = "black", times = length(h$labels))
cols[which(h$labels %in% five_cell_lines)] <- "red"
pdf(width = 8.5, height = 4.5,
file = "manuscript_data/Fig_S1.pdf")
plot(as.dendrogram(h),
label.offset = 1, nodePar = list(lab.cex = 0.7, cex = 0.7,
pch = c(NA, 19)))
dev.off()
FALSE png
FALSE 2
plot(as.dendrogram(h),
label.offset = 1, nodePar = list(lab.cex = 0.7, cex = 0.7,
pch = c(NA, 19)))
Visualizing features of the cells in the different bubbles in the bubbletree can help us understand the biology behind different clusters (bubbles) as well as the structure of the tree. For instance, we may want to show the average expression of a certain marker gene in each bubble, or maybe the percent of mitochondrial gene content, both of which are numeric features.
Cells may also have categorical features. For instance: a) a cell may have predicted cell type label; b) a cell may be categorized as a singlet, doublet or multiplet; c) for a given cell we may have or have not found a B-cell receptor sequence in a corresponding immune profiling library.
In the next two paragraph we will explain how to ‘attach’ numeric and categorical features to the bubbletree using scBubbletree.
Categorical features can be ‘attached’ to the bubbletree using the function get_cat_feature_tiles. One of the categorical features in this case study are the predicted cell types. With get_cat_feature_tiles we can show make two types of visualization:
First, we can show the relative frequency distribution of a feature across the five bubbles (with parameter feature_composition=T). Second, we can show the cell type composition of each bubble (with parameter feature_composition=F):
w1 <- get_cat_feature_tiles(d = k5,
a = d@meta.data$cell_line_demuxlet,
feature_composition = T,
round_digits = 1,
rotate_x_axis = T)
scBubbletree uses the R-package patchwork to combine plots. Lets merge the bubbletree and the numeric features plot:
(k5$tree|w1$w)+patchwork::plot_layout(ncol = 2, widths = c(1, 1))
This plot tells us that each cells type is nearly exclusively (>99%) found in its own bubble, i.e. columns integrate to 100%. For instance, 99.76% of the A549 cells are found in bubble 3, 99.09% of the H1975 cells are found in bubble 5. Few cells are mixed between the bubbles. This mixing is also present in the 2D UMAP, however, doe to the heavy overplotting they are not visible. Our approach is more transparent in this sense, and show in a quantitative way the clustering output.
We might also be interested in checking the ‘purity’ of individual bubbles. For this we will draw a similar plot but make sure that the rows integrate to 100%:
w2 <- get_cat_feature_tiles(d = k5,
a = d@meta.data$cell_line_demuxlet,
feature_composition = TRUE)
(k5$tree|w2$w)+patchwork::plot_layout(ncol = 2, widths = c(1, 1))
The above feature plot informs us that 100% of the cells in bubble 5 belong to cell type H1975. About 0.17% of the 593 cells in bubble 4 belong to cell A549. This is equal to 0.17/100*593 = 1 cell, while the majority of cells 98.99 belong to cell HCC827. Hence, we see quite high bubble purity.
Using the Gini impurity index we can quantify the goodness of the clustering with k=5 and the impurity of the individual bubbles. This functionality is implemented by the function get_gini.
FALSE $cluster_gini
FALSE 4 2 5 1 3
FALSE 0.020099588 0.004563592 0.000000000 0.010474495 0.000000000
FALSE
FALSE $total_gini
FALSE [1] 0.006095786
All cluster-specific Gini impurity indices are close to 0 and thus also the total (global) impurity has a value close to 0 as well. This indicates nearly perfect clustering of cell types across bubbles. With this function, scBubbletree provides a quantitative way of summarizing the tile plots shown earlier.
scBubbletree also implement the function get_gini_boot, which allows us to integrate clustering data obtained from get_k with cell type predictions (labels), and to show quantitatively the Gini impurity index as a function of k. We can conclude that at k=5 all labels are nearly perfectly segmented across the bubbles, and each bubble contains cells exclusively from one cell type (1-to-1 mapping between bubbles and cell types as seen shown by the tile plots).
gini_boot <- get_gini_boot(labels = d@meta.data$cell_line_demuxlet,
kmeans_boot_obj = b)
g1 <- ggplot(data = gini_boot$total_gini_summary)+
geom_point(aes(x = k, y = total_gini_mean), size = 1)+
geom_errorbar(aes(x = k, y = total_gini_mean, ymin = L95,
ymax = H95), width = 0.1)+
ggtitle(subtitle = "Gini impurity", label = '')+
ylab(label = "Mean impurity")
g1
scBubbletree also implements add-ons for visualization of numeric cell features, such as gene expression, number of reads or features, mitochondrial content etc. Lets visualize the expression of five marker genes, i.e. one marker gene for each of the five cancer cell lines.
# First we need to select gene expressions for each cell and
# also for five marker genes
as <- as.matrix(t(d@assays$SCT@data[
rownames(d@assays$SCT@data) %in%
c("ALDH1A1",
"PIP4K2C",
"S100A9",
"PEG10",
"SLPI"), ]))
# 'as' is a matrix with n=rows for cells and a=columns for
# annotations (genes). The column names will be shown in
# the plot.
# We will order the columns in 'as' in the same way we want
# them to be plotted. These genes are markers for: A549,
# HCC827, H1975, H2228 and H838
as <- as[, c("ALDH1A1",
"PIP4K2C",
"S100A9",
"PEG10",
"SLPI")]
We can visualize numeric features in two ways.
First, we can show the feature averages (in this example: average gene expressions) in each bubble with get_num_feature_tiles. Lets invoke this function now:
# This function build the nummeric annotations plot
w3 <- get_num_feature_tiles(d = k5,
as = as,
plot_title = "",
round_digits = 1)
# Plot
(k5$tree|w3$w)+patchwork::plot_layout(widths = c(1, 1))
Second, we can visualize the distributions of the numeric features in each bubble as violins, while the cell-specific values are shown as jittered points. We can do this with get_num_feature_violins. This function uses the same input as get_num_feature_tiles:
w4 <- get_num_feature_violins(d = k5,
as = as,
plot_title = "",
scales = 'free_x',
show_cells = F)
(k5$tree|w3$w|w4$w)+patchwork::plot_layout(widths = c(1.5, 1.1, 2.5))
scBubbletree uses the R-package patchwork to combine ggplot2 and ggtree plots. This makes it convenient for users to ‘attach’ other plots that are generated by any of these package.
Lets combine the UMAP plots shown earlier with the output of scBubbletree:
scBubbletree is intended to promote simple and transparent analysis of scRNA-seq. The user is encouraged to optimize k and at each stage to perform feature annotation to understand the biology behind the new bubbles. At the end, user of scBubbletree must be able answer the following questions:
k biologically justifiable?We will provide answer to these questions in Case study B.
Wide range of clustering approaches are used to cluster scRNA-seq data. To accommodate for this scBubbletree implements the function dummy_bubbletree which can generate a bubbletree using segmentation of alternative approaches.
Lets do this using k-medoids clustering implemented as part of the R-package cluster:
FALSE Bubbletree construction ...
In this case study we will analyze scRNA-seq dataset of approximately 161,000 PBMCs derived from 8 healthy donors, collected at three timepoints and sequenced by 10x Genomics protocol. This is a large and complex scRNA-seq dataset, perfect for demonstrating the advantageS of scBubbletree over qualitative analyses e.g. based on 2D UMAPs.
In addition to gene expression data, this dataset reports for each cell a cell type prediction at three levels of resolution (l1, l2 and l3). The cell types have been predicted based on marker genes, and we will use them as categorical cell annotations. As this dataset is part of a larger CITE-seq experiment, the cell type predictions are made based on scRNA-seq derived gene expressions and also on the cell-surface protein expression levels of up to 228 marker proteins.
In essence, we will perform the same steps as in Case study A:
k \(\rightarrow\) with \(f:\) get_kget_bubbletreeget_cat_feature_tiles,get_num_feature_tiles,get_num_feature_violinsupdate_bubbletree, get_gini, get_gini_bootk and go back to 2.The raw gene expressions have already been preprocessed with Seurat. SCT transformation was used for normalization, followed by principal component analysis for dimensionality reduction. 50 lower-space features have been extracted and used to generate a 2D UMAP.
Lets load the data.
# To get the data used in this case study do the following steps:
# 1. download reference data from vignette:
# https://satijalab.org/seurat/articles/multimodal_reference_mapping.html
# 2. load SeuratDistk
# library(SeuratDisk, lib.loc = lib.loc)
# d <- LoadH5Seurat("~/BubbleMap/raw_data/Hao_2021/pbmc_multimodal.h5seurat")
# save(d, file = "raw_data/Hao_2021/Hao_2021.RData")
d <- get(load(file = "raw_data/Hao_2021/Hao_2021.RData"))
Lets look at the number of cells per donor at a given sampling time:
table(d@meta.data$time, d@meta.data$donor)
FALSE
FALSE P1 P2 P3 P4 P5 P6 P7 P8
FALSE 0 6443 5978 4698 5307 7020 6093 8909 8916
FALSE 3 5917 5714 5002 5793 6933 7583 8200 9203
FALSE 7 5775 5513 4960 5990 7858 7066 8762 8131
Also, we can show the number of cells per donor and predicted cell type at the lowest resolution (l1):
table(d@meta.data$celltype.l1, d@meta.data$donor)
FALSE
FALSE P1 P2 P3 P4 P5 P6 P7 P8
FALSE B 715 1335 925 1296 1798 3229 3472 1030
FALSE CD4 T 6500 5717 4721 5103 2925 4457 5429 6149
FALSE CD8 T 3233 3497 1623 3049 3605 1503 3232 5727
FALSE DC 334 276 250 333 668 379 904 445
FALSE Mono 5036 3500 4088 5188 8022 6627 9196 7353
FALSE NK 1590 1610 2423 688 2760 3357 2303 3933
FALSE other 491 210 180 154 952 506 576 373
FALSE other T 236 1060 450 1279 1081 684 759 1240
How many distinct cell types are there at each resolution?
length(unique(d@meta.data$celltype.l1))
FALSE [1] 8
length(unique(d@meta.data$celltype.l2))
FALSE [1] 31
length(unique(d@meta.data$celltype.l3))
FALSE [1] 58
Next we will show the 2D UMAP. Cells are color coded according to their cell type predictions resolution level l1 (left UMAP) and l2 (right UMAP):
# Lets show the generated 2D UMAP
UMAPPlot(object = d,
reduction = "umap",
group.by = 'celltype.l1',
pt.size = 0.25)|
UMAPPlot(object = d,
reduction = "umap",
group.by = 'celltype.l2',
pt.size = 0.25)
Now that the UMAP contains more than 161,000 cells from a heterogeneous tissue, we start to see the challenges associated with this approach:
massive overplotting
lack of compositional information
qualitative analysis
Now lets turn to scBubbletree. As earlier, our first goal is to determine k for the clustering of matrix \(A^{n \times f}\), which represents the low- dimensional feature space of the normalized gene expression matrix. We will use the first \(f\)=15 PCA dimensions as features. We selected \(f\)=15 based on the elbow plot below.
ElbowPlot(object = d, ndims = 50, reduction = "pca")
# This is the main input of scBubbletree
# -> matrix A with n=cells, f=features (from PCA)
A <- d@reductions$pca@cell.embeddings[, 1:15]
# meta data
meta <- d@meta.data
# quantitative features (gene expressions of marker genes) to be used later on:
# * GNLY, NKG7: NK cells
# * IL7R: CD4 T cells
# * CD8A: CD8 T cells
# * MS4A1: B cells
# * CD14, LYZ: CD14+ Monocytes
# * FCGR3A, MS4A7: FCGR3A+ Monocytes
# * FCER1A, CST3: Dendritic Cells
# * PPBP: Megakaryocytes
as <- as.matrix(t(d@assays$SCT@data[
rownames(d@assays$SCT@data) %in%
c("IL7R",
"CD14", "LYZ",
"MS4A1",
"CD8A",
"GNLY", "NKG7",
"FCGR3A", "MS4A7",
"FCER1A", "CST3",
"PPBP"), ]))
FALSE used (Mb) gc trigger (Mb) max used (Mb)
FALSE Ncells 8356105 446.3 12798815 683.6 12798815 683.6
FALSE Vcells 1191408814 9089.8 1777478561 13561.1 1243503274 9487.2
We will once again use the function get_k for clustering. Notice the modified parameter cv_index_p=0.2, for faster execution of this large dataset:
if(redo_case_b) {
# Determine appropriate number of clusters (k)
b <- get_k(B = 30,
cv_clust_p = 1, # use 100% for clustering
cv_index_p = 0.2, # use 20% for gap stat.
ks = 1:40,
x = A,
n_start = 10,
iter_max = 20,
cores = 20)
if(dir.exists("case_study_B")==F) {
dir.create("case_study_B")
}
save(b, file = "case_study_B/b.RData")
} else {
b <- get(load("~/scBubbletree/case_study_B/b.RData"))
}
We see a more complex silhouette curve. This is normal given the complexity in our dataset, which is composed of many cell types with high and low relative frequencies. We see a maximum silhouette index at k=5. The index then drops sharply until k=11, followed by shallow increase until k=14 after the silhouette starts once again to decrease.
Elbows are not easy to detect from the Gap and WSS curves. It is quite evident, however, that both curves flatten at around k=15 to k=20.
k=5Lets start with k=5 and use get_bubbletree to perform k-means clustering followed by hierarchical clustering:
if(redo_case_b) {
# Determine appropriate number of clusters (k)
k5 <- get_bubbletree(x = A,
k = 5,
n_start = 100,
iter_max = 100,
seed = 4321,
cores = 5,
B = 50,
N_eff = 100,
verbose = FALSE)
save(k5, file = "case_study_B/k5.RData")
} else {
k5 <- get(load("~/scBubbletree/case_study_B/k5.RData"))
}
Next, we can plot the bubbletree:
k5$tree
Does this make biological sense? Lets visualize the cell type compositions (figure B, notice parameter feature_composition = F) of the different bubbles using the categorical feature l1. Furthermore, we can also show the composition of features across the different bubbles (figure C, notice parameter feature_composition = T).
Summary of the bubbletree and figure B:
# cell type annotations (level 1)
a <- meta$celltype.l1
# create tile plot showing cluster compositions
w1 <- get_cat_feature_tiles(d = k5,
a = a,
feature_composition = F,
round_digits = 1,
show_hclust = F)
# create tile plot showing feature compositions
w2 <- get_cat_feature_tiles(d = k5,
a = a,
feature_composition = T,
round_digits = 1,
show_hclust = F)
(k5$tree|w1$w|w2$w)+
patchwork::plot_layout(widths = c(1, 1.5, 1.5))+
patchwork::plot_annotation(tag_levels = 'A')
Earlier we stated that the data is derived from 8 donors at 3 timepoints. Are the cells in each sample well mixed in each of the different bubbles?
We see interesting patterns in the feature compositions between the samples as shown in figure D and the corresponding dendrogram with two branches. For instance bubble 3 cells (CD4+ and CD8+ T-cells) appear to be enriched in P4 samples. We do not see systematic biases between samples that correspond to different timepoints.
# lets first create a variable showing pairs (donor x timepoint)
a <- paste0(meta$donor, '_', meta$time)
# categorical donor meta data
w3 <- get_cat_feature_tiles(d = k5,
a = a,
feature_composition = T,
round_digits = 0,
show_hclust = T)
((k5$tree|w1$w|w2$w)+patchwork::plot_layout(widths = c(1, 1.5, 1.5)))/(w3$w)+
patchwork::plot_annotation(tag_levels = 'A')
Another categorical feature which we can visualized are the predicted phases of the different cells: G1 (cell growth), S (DNA replication), G2 (growth and preparation for mitosis), M (mitosis: ell division occurs). These prediction are included as part of the meta data.
In the next plot we show the composition of each bubble with respect to these three cell phases:
# cell type annotations (level 1)
a <- meta$Phase
# create tile plot
w4 <- get_cat_feature_tiles(d = k5,
a = a,
feature_composition = T,
round_digits = 1)
((k5$tree|w1$w|w2$w)+patchwork::plot_layout(widths = c(1, 1.5, 1.5)))/
((w3$w|w4$w)+patchwork::plot_layout(widths = c(3, 1)))+
patchwork::plot_annotation(tag_levels = 'A')
k=15It is apparent from the results generated by setting k=5 that some bubbles are impure, i.e. they contain cells from different cell types, e.g. bubble 2 contains a mixture of two unrelated cell types, i.e. B-cells and DCs.
Furthermore, our goal is to inspect the data at a finer cellular resolution. For instance, lets see how well k=5 clustering segments the l2 cell type predictions?
As expected, many cluster contains multiple cell types \(\rightarrow\) if we want to identify cell clusters at l2 resolution it is evident that we need to increase k. Lets try k=15 (this is also a reasonable value for k based on silhouette, WSS and gap curves shown above).
# cell type annotations (level 1)
a <- meta$celltype.l2
# create tile plot showing cluster compositions
w1 <- get_cat_feature_tiles(d = k5,
a = a,
feature_composition = T,
round_digits = 0,
show_hclust = F)
(k5$tree|w1$w)+
patchwork::plot_layout(widths = c(1, 4))+
patchwork::plot_annotation(tag_levels = 'A')
if(redo_case_b) {
# do clustering with k=15
k15 <- get_bubbletree(x = A,
k = 15,
n_start = 200,
iter_max = 1000,
seed = 4321,
cores = 5,
B = 50,
N_eff = 100,
verbose = FALSE,
round_digits = 1)
save(k15, file = "case_study_B/k15.RData")
} else {
k15 <- get(load("~/scBubbletree/case_study_B/k15.RData"))
}
Summary of the bubbletree:
# cell type annotations (level 2)
a <- meta$celltype.l2
# create tile plot
w1 <- get_cat_feature_tiles(d = k15,
a = a,
feature_composition = T,
round_digits = 0,
show_hclust = F)
(k15$tree|w1$w)+patchwork::plot_layout(ncol = 2, widths = c(3, 9))
Interestingly, the bubbles 1 and 2 contain many different cell types. Additional clustering of these bubbles might be necessary.
scBubbletree can perform targeted clustering of cells within specific bubbles. This can be achieved with the function update_bubbletree. Lets update bubbles 1 and 2 into two sub-bubbles.
Important remark: the function update_bubbletree should be used to explore the clustering in order to select an appropriate k and to explain the segmentation.
if(redo_case_b) {
# do clustering with k=15
u_k15 <- update_bubbletree(btd = k15,
updated_bubbles = c("2", "1"),
ks = c(2, 2),
cores = 20)
save(u_k15, file = "case_study_B/u_k15.RData")
} else {
u_k15 <- get(load("~/scBubbletree/case_study_B/u_k15.RData"))
}
# create tile plot
w2 <- get_cat_feature_tiles(d = u_k15,
a = a,
feature_composition = T,
round_digits = 0,
show_hclust = F)
(u_k15$tree|w2$w)+patchwork::plot_layout(ncol = 2, widths = c(4, 9))
Can we show quantitatively that by increasing k we get “better” clustering in a semi-supervised way? Yes, we can use the Gini impurity index.
For this we will integrate the results obtained by function get_k with l1, l2 and l3 cell type predictions (labels), and show quantitatively the change in Gini index as a function of k. This is done with the function get_gini_boot. One potential caveat of this approach is that some of the predictions might be inaccurate.
Lets invoke the function get_gini_boot and pass the object b obtained earlier together with the cell type predictions:
b <- get(load("~/scBubbletree/case_study_B/b.RData"))
gini_l1 <- get_gini_boot(labels = meta$celltype.l1,
kmeans_boot_obj = b)
gini_l2 <- get_gini_boot(labels = meta$celltype.l2,
kmeans_boot_obj = b)
gini_l3 <- get_gini_boot(labels = meta$celltype.l3,
kmeans_boot_obj = b)
We next plot the Gini scores for the labels at resolution l1, l2 and l3.
Summary of the above plot:
k approaches 20.Gene expression annotations can also be integrated to better explain the bubbles and the tree structure. Lets visualize the mean gene expression of some marker genes in the bubbles:
# This function build the nummeric annotations plot
w3 <- get_num_feature_tiles(d = k15,
as = as,
plot_title = "",
round_digits = 1)
# Plot
(k15$tree|w3$w)+patchwork::plot_layout(widths = c(1, 1))
Second, we can visualize the distribution of each marker gene in each bubble using violin plots with get_num_feature_violins. This function uses the same input as get_num_feature_tiles.
Lets invoke this function now.
w4 <- get_num_feature_violins(d = k15,
as = as,
plot_title = "",
scales = 'free_x')
((k15$tree|w3$w)+patchwork::plot_layout(widths = c(1.5, 3)))/w4$w
Tian, Luyi, et al. “Benchmarking single cell RNA-sequencing analysis pipelines using mixture control experiments.” Nature methods 16.6 (2019): 479-487.↩︎
Hao, Yuhan, et al. “Integrated analysis of multimodal single-cell data.” Cell 184.13 (2021): 3573-3587.↩︎
https://satijalab.org/seurat/articles/multimodal_reference_mapping.html↩︎